Yao Yao
Here at Royal Caribbean Cruises Ltd. we have a commitment to reducing our footprint to the environment. Improving energy efficiency is a critical element in our quest for sustainability. Royal Caribbean Cruises Ltd. takes pride in being a leader in the use of new technologies to build and design ships that are more energy efficient. In one of our testing labs we have installed high-efficiency appliances as well as LED lightbulbs.
As part of our lab testing you have been tasked to analyze and create a prediction model for the amount of energy used by these rooms. You have been supplied with the data set and the description of the fields. This is an open challenge where we want to see your ability to formulate and solve a problem as well as your creativity. Below are some of the things we are looking for:
import sys
try:
sys.getwindowsversion()
except AttributeError:
isWindows = False
else:
isWindows = True
if isWindows:
import win32api,win32process,win32con
pid = win32api.GetCurrentProcessId()
handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid)
win32process.SetPriorityClass(handle, win32process.HIGH_PRIORITY_CLASS)
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from datetime import datetime
import time
from sklearn.model_selection import KFold
import itertools
import datetime as dt
from ggplot import *
%matplotlib inline
Based on the data set description, the target variable is the total consumption in watt-hours based on the day, time, month, temperature, and humidity in 9 different rooms as well as the outside. The outside pressure, visibility and wind speed are also given as features to determine target energy use. Therefore, we are using regression to solve for a continuous target variable, figure out which features contribute the most to total consumption in watt-hours, and perhaps how to lower total consumption based on the different variables. I elected to use decision tree and random forest regressors because they tend to work well with a relatively sparse dataset than that for linear or polynomial regression. As I will illustrate below, the stepwise branches, leafs, and nodes will have better accuracy.
from IPython.display import HTML, display
import tabulate
table = [["date","time year-month-day hour:minute:second"],
["TotalConsmp (AC+TV+LED+Peripherals)","energy use in Wh (Target)"],
["R1","Temperature in Room 1 in Celsius"],
["H_1","Humidity Room 1 in %"],
["R2","Temperature in Room 2 in Celsius"],
["H_2","Humidity in Room 2 in %"],
["R3","Temperature in Room 3 in Celsius"],
["H_3","Humidity in Room 3 in %"],
["R4","Temperature Room 4 in Celsius"],
["H_4","Humidity in Room 4 in %"],
["R5","Temperature in Room 5 in Celsius"],
["H_5","Humidity in Room 5 in %"],
["R6","Temperature Room 6 in Celsius"],
["H_6","Humidity in Room 6 in %"],
["R7","Temperature in Room 7 in Celsius"],
["H_7","Humidity in Room 7 in %"],
["R8","Temperature in Room 8 in Celsius"],
["H_8","Humidity in Room 8 in %"],
["R9","Temperature in Room 9 in Celsius"],
["H_9","Humidity in Room 9 in %"],
["To","Temperature outside in Celsius"],
["Pressure outside","in mm Hg"],
["RH_out","Humidity outside in %"],
["Windspeed","in m/s"],
["Visibility","in km"]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))
df=pd.read_csv("data.csv", engine = 'c')
print("Table 2: Summary of Values")
df.describe().T
Days vs night, monthly, weekly dummy variables were made but ultiamtely not used because the decision tree process would have a better stepwise cutoff and threshold rate for determining how features could predict the target total consumption.
Which features that contribute most to energy consumption are later illustrated by feature importance associated for the best decision tree model.
First, look for any missing variables and use monte carlo imputation to fill N/As. Luckily, the dataset is complete. Because we are using decision trees and random forest, those algorithms are more leninent for outliers because the threshold system would rule that to be one grouping instead of having extreme outliers influence the overall model for regular linear or polynomial modeling. In some cases, outliers are preferred because nuances with the dataset could be detected with decision trees and random forest that continuous equation models would not pick up on for the fitting process.
df.isnull().T.any().T.sum()
Then, split the date into date and time columns
df['date'], df['time'] = df['date'].str.split(' ', 1).str
Convert the time into continuous hours from 0 to 24
df['time']= df['time'].str.split(':').apply(lambda x: int(x[0]) + int(x[1])/60)
Since the days start at the beginning of 2016, use that as a reference day for the subsequent days to follow for days since 1/1/2016.
def compare_dates(date):
date_format = '%m/%d/%Y'
current_date = datetime.strptime(date, date_format)
start = datetime(2016,1,1)
diff = current_date - start
return diff.days
#apply this function to your pandas dataframe
df['Days_Since_1_2016'] = df['date'].apply(compare_dates)
To create the weekday variable, convert into datetime and the numerical 1 to 7 constitute the days.
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['weekday'] = df['date'].dt.dayofweek + 1
Similarly for months, convert the months into numerical from 1 to 12
df['month']=df['date'].dt.month
df = df.drop(columns='date')
For weekends, create a dummy variable to find weekends
df['weekend'] = np.where(df['weekday'] > 5, 1, 0)
Summary of columns after new ones are created and separated
print("Table 3: Summary of Values")
df.describe().T
Total consumption has a right skew and is corrected by logging the values
log_columns = ['TotalConsmp']
for column in log_columns:
df[column+'*'] = np.log(df[column]+1)
df = df.drop(columns=log_columns)
The heat index combines humidity and temperature into one overall temperature: https://en.wikipedia.org/wiki/Heat_index#Formula. Allowing for the assumption that humans are in the rooms, the "feeling" of temperature can be greatly influenced by how humid a room is. Therefore, a equation is used to determine if overall temperature could influence heating, which would contribute to overall energy consumption. Because the heat index is in Fahrenheit, the temperature has to be converted and then converted back into the metric system because the units of all the rest of the dataset is in the metric system.
40*1.8+32
As an example, if the temperature is 85 degrees F and the room has a humidity of 70%, the overall heat index value is 92.7, which checks out using this online calculator for the heat index: https://www.calculator.net/heat-index-calculator.html
c_1 = -42.379
c_2 = 2.04901523
c_3 = 10.14333127
c_4 = -0.22475541
c_5 = -6.83783 * 10**-3
c_6 = -5.481717 * 10**-2
c_7 = 1.22874 * 10**-3
c_8 = 8.5282 * 10**-4
c_9 = -1.99 * 10**-6
T = 85
R = 70
HI = c_1 + c_2 * T + c_3 * R + c_4 * T * R + c_5 * T**2 + c_6 * R**2 + c_7 * T**2 * R + c_8 * T * R**2 + c_9 * T**2 * R**2
HI
Therefore, each temperature in celcius is converted to fahrenheit for the heat index, which is then converted back into celcius to maintain the same metric units for the whole dataset.
room_columns = ['R1','R2','R3','R4','R5','R6','R7','R8','R9']
humid_columns = ['H_1','H_2','H_3','H_4','H_5','H_6','H_7','H_8','H_9']
for T, R in zip(room_columns, humid_columns):
df[T+R] = (c_1 + \
c_2 * (df[T] * 1.8 + 32) + \
c_3 * df[R] + \
c_4 * (df[T] * 1.8 + 32) * df[R] + \
c_5 * (df[T] * 1.8 + 32)**2 + \
c_6 * df[R]**2 + \
c_7 * (df[T] * 1.8 + 32)**2 * df[R] + \
c_8 * (df[T] * 1.8 + 32) * df[R]**2 + \
c_9 * (df[T] * 1.8 + 32)**2 * df[R]**2 - 32) / 1.8
On the same website, wind chill has to be factored in as well for the outside that the inside does not include to calculate overall temperature: https://www.calculator.net/wind-chill-calculator.html. Using the Australian Apparent temperature equation, it factors everything in the metric system involving temperature, humidity, and wind speed to create one overall heat and wind chill index: https://en.wikipedia.org/wiki/Wind_chill#Australian_Apparent_Temperature.
outTemp = ['TempOutSide']
outHumid = ['H_OutSide']
outWind = ['Windspeed']
for T, H, W in zip(outTemp, outHumid, outWind):
df[T+H] = df[T] + 0.33*(df[H]/100*6.105*np.exp((17.27*df[T])/(237.7+df[T]))) - 0.7*df[W] - 4
Now that the engineered features are created, we graph for skewity for which features need to be adjusted by the log scale. Since the procedure is recursive, we found that total consumption need to be transformed by log.
df.head().T
When graphing frequency of distributions for the features, most of them look normal and the decision tree and random forest procedure would sequester some piecewise factors and thresholds for better regression prediction
plt.rcParams['figure.max_open_warning']=40
colnames=list(df.select_dtypes(exclude='O').columns.values)
for i in colnames[0:]:
facet = sns.FacetGrid(df,aspect=2)
facet.map(sns.distplot,i)
facet.add_legend()
facet.fig.suptitle(''.join(map(str, list(["Figure ",colnames.index(i)+1,": ",i," Distribution By Label"]))))
plt.show()
When graphing the difference between weekends and weekdays, it is determined that the separation of total consumption over time of day is randomized in pattern.
ggplot(df, aes(x='time', y='TotalConsmp*', color='weekend')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 40: Total Consumption over Time for Weekends and Non-Weekends")
When graphing the difference among months, it is determined that the separation of total consumption over months is mostly randomized in pattern, where January and cold months have higher total consumption of energy possibly due to heating
ggplot(df, aes(x='time', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 41: Total Consumption over Time for Months")
When graphing temperature outside and total consumption between weekends and weekdays, it is determined that there is a striped pattern that the random forest could recognize when fitting the model.
ggplot(df, aes(x='TempOutSideH_OutSide', y='TotalConsmp*', color='weekend')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 42: Total Consumption vs Temperature outside for Weekends")
When graphing temperature outside and total consumption among months, the winter months were colder and could slightly influence total consumption.
ggplot(df, aes(x='TempOutSideH_OutSide', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 43: Total Consumption vs Temperature outside for Months")
When graphing for various rooms for total consumption for different heat indexes, it is determined that some rooms are kept at a higher temperature depending on month and temperature outside than others. Again, these nuances in data pattern would be picked up by the decision tree/random forest algorithms.
ggplot(df, aes(x='R1H_1', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 44: Total Consumption vs Heat Index for Months for Room 1")
ggplot(df, aes(x='R2H_2', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 45: Total Consumption vs Heat Index for Months for Room 2")
ggplot(df, aes(x='R3H_3', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 46: Total Consumption vs Heat Index for Months for Room 3")
ggplot(df, aes(x='R4H_4', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 47: Total Consumption vs Heat Index for Months for Room 4")
ggplot(df, aes(x='R5H_5', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 48: Total Consumption vs Heat Index for Months for Room 5")
ggplot(df, aes(x='R6H_6', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 49: Total Consumption vs Heat Index for Months for Room 6")
ggplot(df, aes(x='R7H_7', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 50: Total Consumption vs Heat Index for Months for Room 7")
ggplot(df, aes(x='R8H_8', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 51: Total Consumption vs Heat Index for Months for Room 8")
ggplot(df, aes(x='R9H_9', y='TotalConsmp*', color='month')) + geom_point() + scale_color_gradient(low='#05D9F6', high='#5011D1') \
+ ggtitle("Figure 52: Total Consumption vs Heat Index for Months for Room 9")
As stated above, we are using decision trees and random forest, where those algorithms are more lenient for outliers because the threshold system would rule that to be one grouping instead of having extreme outliers influence the overall model for regular linear or polynomial modeling. In some cases, outliers are preferred because nuances with the dataset could be detected with decision trees and random forest that continuous equation models would not pick up on for the fitting process.
Due to the recursive nature of the process, dummy variables were made for months and days of the week but were then redacted because they had no influence on the overall model. The code is in tact just in case we would want to go back to explore
# colnames=["month", "weekday"]
# for i in colnames[0:]:
# # Fill missing data with the word "Missing"
# df[i].fillna("Missing", inplace=True)
# # Create array of dummies
# dummies = pd.get_dummies(df[i], prefix=i)
# # Update X to include dummies and drop the main variable
# df = pd.concat([df, dummies], axis=1)
# df.drop([i], axis=1, inplace=True)
There are several ways to index the target and independent features
y = df.pop('TotalConsmp*')
# ids = pd.Series(y.unique())
# ids = ids.reset_index().set_index(0)
# y_int = ids.loc[y]
Instead of getting a 'lucky sample' by splitting an arbitrary amount by index, the kfold method was used to shuffle the indexed training and testing data values and have the model learn from every fold so it is not overfitted on one given 80:20 split. From the folding process, the training and testing datasets are split 3 times and shuffled to avoid overfitting of a single split dataset with the regular 80:20 split.
# x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.33, random_state=2)
kf = KFold(n_splits = 3, shuffle = True, random_state = 2)
# print('Training Features Shape:', x_train.shape)
# print('Training Target Shape:', y_train.shape)
# print('Testing Features Shape:', x_test.shape)
# print('Testing Target Shape:', y_test.shape)
Due to the recursive method of the modeling process, decision trees and random forest did the best for mean absolute error and accuracy. R^2 was not used as a method to evaulate performance because of how amorphous the data pattern is when graphed. Because the error is in decimals, mean absolute error was used over mean squared error.
model2 = DecisionTreeRegressor()
for train_idx, test_idx in kf.split(df, y):
model2.fit(df.loc[train_idx],y.loc[train_idx])
error = abs(model2.predict(df.loc[test_idx]) - y.loc[test_idx])
print('Mean Absolute Error:', round(np.mean(error), 2), 'log wH.')
print('Accuracy:', round(100 - np.mean(100 * (error / y.loc[test_idx])), 2), '%.')
model3 = RandomForestRegressor(random_state=2)
for train_idx, test_idx in kf.split(df, y):
model3.fit(df.loc[train_idx],y.loc[train_idx])
error = abs(model3.predict(df.loc[test_idx]) - y.loc[test_idx])
print('Mean Absolute Error:', round(np.mean(error), 2), 'log wH.')
print('Accuracy:', round(100 - np.mean(100 * (error / y.loc[test_idx])), 2), '%.')
With the baseline tests, the random forest regressor did slightly better for both metrics and will be further used to determine the best parameters for this algorithm.
model4 = RandomForestRegressor(n_estimators=100, random_state=2)
for train_idx, test_idx in kf.split(df, y):
model4.fit(df.loc[train_idx],y.loc[train_idx])
error = abs(model4.predict(df.loc[test_idx]) - y.loc[test_idx])
print('Mean Absolute Error:', round(np.mean(error), 2), 'log wH.')
print('Accuracy:', round(100 - np.mean(100 * (error / y.loc[test_idx])), 2), '%.')
feature_list = list(df.columns)
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = model4.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
graph.write_png('tree.png')
The complexity of the random forest is shown below, and is more reliable than linear or polynomial regression
from IPython.display import Image
Image("tree.png")
# Limit depth of tree to 3 levels
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3)
rf_small.fit(df.loc[train_idx],y.loc[train_idx])
# Extract the small tree
tree_small = rf_small.estimators_[5]
# Save the tree as a png image
export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png');
This embedded image generated by the model shows the mean squared error of how time and room and heat index are sequestered into their own regression values on the small scale
Image("small_tree.png")
Time of day, certain room temperature, heat index, and days since the beginning of 2016 proved to be the top important features to determine total energy consumption. Due to the complexity of the heat index and apparent temperature from feature engineering, those are included with their original temperature, humidity, and wind chill constituents because some rooms may be more used by people and others are more for appliances, however warm or however cold depending on the outside temperature. Even though pressure does not have an equation for feature engineering, it was surprising that it contributed towards the top 5 important features [Figure 53].
feature_importances = pd.Series(model4.feature_importances_, index=df.columns)
feature_importances.sort_values().plot(kind="barh", figsize=(10,20),
title = "Figure 53: Order of Important Features for Random Forest Regressor");
feature_importances.sort_values(ascending=False)[:35]
feature_importances.sort_values(ascending=False)[-3:]
The features that contribute the least for the winning model in random forest are room 9's temperature, weekend boolean, and what month it is. Room 9 for total energy output is more based on humidity factor. Weekend doesn't affect energy consumption as the whole spectrum of the week from 1 to 7. Month can be redundant to how many days since the start of 2016 as well as colinear to outside temperature.
# x_test.drop(feature_importances.sort_values(ascending=False)[-3:].index, axis=1, inplace=True)
# x_train.drop(feature_importances.sort_values(ascending=False)[-3:].index, axis=1, inplace=True)
df.drop(feature_importances.sort_values(ascending=False)[-3:].index, axis=1, inplace=True)
After dropping the less important features with 0.01 feature importance or lower, the mean absolute and accuracy values increased possibly because those features were overfitted and the algorithm became better once those were removed.
model = RandomForestRegressor(n_estimators=100, random_state=2)
for train_idx, test_idx in kf.split(df, y):
model.fit(df.loc[train_idx],y.loc[train_idx])
error = abs(model.predict(df.loc[test_idx]) - y.loc[test_idx])
print('Mean Absolute Error:', round(np.mean(error), 2), 'log wH.')
print('Accuracy:', round(100 - np.mean(100 * (error / y.loc[test_idx])), 2), '%.')
Time of day is the most important factor for determining total energy consumption and the best way to visualize the prediction values after is to have time of day on the x axis while comparing total consumption on the y.
feature_importances = pd.Series(model.feature_importances_, index=df.columns)
feature_importances.sort_values().plot(kind="barh", figsize=(10,10),
title = "Figure 54: Order of Features for Random Forest Regressor");
Had I used more time to tune for hyper parameters, I would have used more estimators for a better accuracy and mean absolute error. I used cross validation with grid search and kfold to make sure that the model is learning from every iteration of the fitting process to avoid overfitting on a 'lucky sample'
hyperparameters = {'min_samples_split': [2,3,4],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['mse']
}
rfr = RandomForestRegressor(random_state=2,
#n_estimators=100,
#n_jobs=-1
)
search = GridSearchCV(rfr, hyperparameters, cv=5, scoring = 'neg_mean_squared_error')
search.fit(df.loc[train_idx],y.loc[train_idx])
search.best_estimator_
With the hyperparameters tuned based on max features of square root, a min samples split of 4, and a criterion of mean absolute error, the number of estimators are increased for the model to be more accurate while using the kfold process. The model is 94.8% accurate with an mean absolute error of 0.23 log wH.
rfr = RandomForestRegressor(criterion='mse',
n_estimators=100,
#n_jobs=-1,
random_state=2,
max_features='sqrt',
min_samples_split=4)
for train_idx, test_idx in kf.split(df, y):
rfr.fit(df.loc[train_idx],y.loc[train_idx])
error = abs(rfr.predict(df.loc[test_idx]) - y.loc[test_idx])
print('Mean Absolute Error:', round(np.mean(error), 2), 'log wH.')
print('Accuracy:', round(100 - np.mean(100 * (error / y.loc[test_idx])), 2), '%.')
Due to the scattered nature of the dataset, time of day is the best feature to plot against total consumption for a pattern to emerge. All of the rest of the features are less as important. As shown, night time energy consumption is low between 11 pm until 7 am. Then there is a hump in energy consumption from 7 am to 3 pm. The dip at 3 pm is followed by another hump from 4 pm to 11 pm.
plt.figure(figsize = (30,20))
# Plot the actual values
plt.plot(df['time'], y, 'b-', label = 'actual')
# Plot the predicted values
plt.plot(df['time'], rfr.predict(df), 'ro', label = 'prediction')
plt.legend()
plt.minorticks_on()
# Graph labels
plt.xlabel('Time of Day'); plt.ylabel('Total Consumption (Wh Log)'); plt.title('Actual vs Predicted Total Consumption (Log)');
Based on the data set description, the target variable is the total consumption in watt-hours based on the day, time, month, temperature, and humidity in 9 different rooms as well as the outside. The outside pressure, visibility and wind speed are also given as features to determine target energy use. Therefore, we are using regression to solve for a continuous target variable, figure out which features contribute the most to total consumption in watt-hours, and perhaps how to lower total consumption based on the different variables. I elected to use decision tree and random forest regressors because they tend to work well with a relatively sparse dataset than that for linear or polynomial regression. As I will illustrate below, the stepwise branches, leafs, and nodes will have better accuracy.
Because we are using decision trees and random forest, those algorithms are more leninent for outliers because the threshold system would rule that to be one grouping instead of having extreme outliers influence the overall model for regular linear or polynomial modeling. In some cases, outliers are preferred because nuances with the dataset could be detected with decision trees and random forest that continuous equation models would not pick up on for the fitting process.
For feature engineering, the heat index combines humidity and temperature into one overall temperature. Using the Australian Apparent temperature equation, it factors everything in the metric system involving temperature, humidity, and wind speed to create one overall heat and wind chill index. Allowing for the assumption that humans are in the rooms, the "feeling" of temperature can be greatly influenced by how humid a room is. Therefore, a equation is used to determine if overall temperature could influence heating, which would contribute to overall energy consumption.
Instead of getting a 'lucky sample' by splitting an arbitrary amount by index, the kfold method was used to shuffle the indexed training and testing data values and have the model learn from every fold so it is not overfitted on one given 80:20 split. From the folding process, the training and testing datasets are split 3 times and shuffled to avoid overfitting of a single split dataset with the regular 80:20 split.
For Feature Importance, time of day, certain room temperature, heat index, and days since the beginning of 2016 proved to be the top important features to determine total energy consumption. Due to the complexity of the heat index and apparent temperature from feature engineering, those are included with their original temperature, humidity, and wind chill constituents because some rooms may be more used by people and others are more for appliances, however warm or however cold depending on the outside temperature. Even though pressure does not have an equation for feature engineering, it was surprising that it contributed towards the top 5 important features.
After dropping the less important features with 0.01 feature importance or lower, the mean absolute and accuracy values increased possibly because those features were overfitted and the algorithm became better once those were removed.
Time of day is the most important factor for determining total energy consumption and the best way to visualize the prediction values after is to have time of day on the x axis while comparing total consumption on the y.
With the hyperparameters tuned based on max features of square root, a min samples split of 4, and a criterion of mean absolute error, the number of estimators are increased for the model to be more accurate while using the kfold process. The model is 94.8% accurate with an mean absolute error of 0.23 log wH.
Due to the scattered nature of the dataset, time of day is the best feature to plot against total consumption for a pattern to emerge. All of the rest of the features are less as important. As shown, night time energy consumption is low between 11 pm until 7 am. Then there is a hump in energy consumption from 7 am to 3 pm. The dip at 3 pm is followed by another hump from 4 pm to 11 pm.